Exact and Approximate Methods for Data Directed Microaggregation in One or More Dimensions

نویسنده

  • Gordon Sande
چکیده

Microaggregation is a technique used for the protection of the confidentiality of respondents in microdata releases. It is typically used for economic data where respondent identifiability is quite high. Rather than releasing a perturbed version of the data, microaggregation releases the averages of small groups in which no single respondent is dominant. The original form of microaggregation was for univariate data. It was implemented by sorting the data and then reporting the averages of adjacent groups of fixed size. Any partial group at the end would be pooled with the final complete group to ensure that the desired minimum group size was obtained. The typical group size was small, with five a common choice. An immediate improvement would be to allow some number of internal groups, perhaps near the center of the data, to be larger to compensate for the incomplete group. As a further improvement the groups can be allowed to have varying sizes so that no group will include a large gap in the sorted data. Each of the resulting groups can be more homogeneous when the group boundaries are allowed to be sensitive to the distribution of the data. This can be described as a clustering problem with a variable number of clusters and a minimum cluster size. The number of clusters is chosen to be as large as possible consistent with homogeneous clusters and the minimum cluster size. Techniques for determining such data directed microaggregations have been proposed which use randomized searching methods. These methods are typically terminated early as they are quite expensive to operate. They seek to minimize the total within cluster sum of squares as suggested by some clustering methods. They have two disadvantages of not leading to readily solved optimization problems and of not being the most suitable criterion for highly skewed data typical of economic applications. For highly skewed data the width of the clusters may be a more suitable measure. The total within cluster width may be obtained by summing the gaps between adjacent members of the clusters. Cluster size may be controlled by requiring a minimum number of adjacent gaps be included in any cluster. The result is an optimization problem for a linear objective function over the indicator variables for the gap inclusions. Each data point and its potential cluster neighbors would appear in a constraint which enforces the minimum cluster size. The resulting system can be readily

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A novel local search method for microaggregation

In this paper, we propose an effective microaggregation algorithm to produce a more useful protected data for publishing. Microaggregation is mapped to a clustering problem with known minimum and maximum group size constraints. In this scheme, the goal is to cluster n records into groups of at least k and at most 2k_1 records, such that the sum of the within-group squ...

متن کامل

Improved Univariate Microaggregation for Integer Values

Privacy issues during data publishing is an increasing concern of involved entities. The problem is addressed in the field of statistical disclosure control with the aim of producing protected datasets that are also useful for interested end users such as government agencies and research communities. The problem of producing useful protected datasets is addressed in multiple computational priva...

متن کامل

Repeated Record Ordering for Constrained Size Clustering

One of the main techniques used in data mining is data clustering, which has many applications in computer science, biology, and social sciences. Constrained clustering is a type of clustering in which side information provided by the user is incorporated into current clustering algorithms. One of the well researched constrained clustering algorithms is called microaggregation. In a microaggreg...

متن کامل

Which is effective: self-directed learning or tutor-directed learning on the level of nursing skills

Introduction. This is quasi experimental research in order to determine and compare the learning level of nursing skills ( in B.A students) with self-directed learning and tutor-directed learning pattern in Shaheed Beheshti Univeristy of Medical Sciences and Health Services, Nursing and Midwifery faculty, 1998-1999. Methods. First of all, a questionnaire composed of some demographic data such ...

متن کامل

Performance Analysis of Device to Device Communications Overlaying/Underlaying Cellular Network

Minimizing the outage probability and maximizing throughput are two important aspects in device to device (D2D) communications, which are greatly related to each other. In this paper, first, the exact formulas of the outage probability for D2D communications underlaying or overlaying cellular network are derived which jointly experience Additive White Gaussian Noise (AWGN) and Rayleigh multipat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems

دوره 10  شماره 

صفحات  -

تاریخ انتشار 2002